About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax. #Title: Introduction to open Data Science 2019

RStudio Exercise 1: Tasks and Instructions

I join this course to discover the world of data science. Beside, I would like to make friend with people who work in the health science area. I am currently a PhD student of the University of Eastern Finland.

My repository link

My course diary


Insert chapter 2 title here

Regression and model validation - exercise 2

Describe the work you have done this week and summarize your learning. ##Data wrangling step Prepare data from original file R script is here Data is here

Install some packages:

*First install.packages(c(“dplyr”,“GGally”,“tidyverse”,“xlsx”,“ggplot2”))

Question 1: Reading data

library(dplyr)
library (ggplot2)
library(readxl)
setwd(getwd())
library(dplyr)
learning2014 <- readxl::read_excel("data/learning2014.xlsx") %>%
  mutate_at(vars(gender), factor)
 dim (learning2014)
 head (learning2014)
 str(learning2014)

Question 2: Describle dataset and graphical overview of the data, summarize of variables in the data

Data includes seven variables: gender,age, attitude, deep, stra, surf, points. In which gender: M (Male), F (Female) age:Age (in years) derived from the date of birth attitude: Global attitude toward statistics deep: average of score related to deep learning stra: average of score related to surf: average of score related to surface questions points: Exam points full data from this :https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-data.txt

View (learning2014)
#describe some variables
as.numeric(learning2014$age)
##   [1] 53 55 49 53 49 38 50 37 37 42 37 34 34 34 35 33 32 44 29 30 27 29 31
##  [24] 37 26 26 30 33 33 28 26 27 25 31 20 39 38 24 26 25 30 25 30 48 24 40
##  [47] 25 23 25 23 27 25 23 23 23 45 22 23 21 21 21 21 26 25 26 21 23 22 22
##  [70] 22 23 22 23 24 22 23 22 20 22 21 22 23 21 22 29 29 21 28 21 30 21 23
##  [93] 21 21 20 22 21 21 20 22 20 20 24 20 19 21 21 22 25 21 22 25 20 24 20
## [116] 21 20 20 31 20 22 22 21 22 21 21 21 21 20 21 25 21 24 20 21 20 20 18
## [139] 21 19 21 20 21 20 21 20 18 22 22 24 19 20 20 17 19 20 20 20 20 19 19
## [162] 22 35 18 19 21
summaryvar <- learning2014 %>%summary (c("age", "deep", "stra", "surf", "points" ))
## Warning in if (length(ll) > maxsum) {: the condition has length > 1 and
## only the first element will be used
print (summaryvar)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
library(dplyr)
library (GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(ggplot2)
library(tidyverse)
## -- Attaching packages ------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ---------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
#see the correlation between variables
pairs.panels (learning2014)

Commenting on the distributions of the variables and the relationships between them

Each individual plot shows the relationship between the variable in the horizontal vs the vertical of the grid gender is the discrete variable so we just forcus on continous variables like: age, attitude, deep, stra, surf, points

For example: correlation between age and attitude = 0.02 attitude toward statistics of people in the survey has positive correlation with age, deep learning, strategy, but negative correlation with surface learning

The diagonal is showing a histogram of each variable and it can be seen on the graph that point and attitude has a strong correlation (r = 0.44)

Question 3

library(GGally)
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, 
        mapping = aes(col = gender, alpha = 0.3), 
        lower = list(combo = wrap("facethist", bins = 20))
        )

Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable.

Test association between points and attitude, deep, stra

cor.test(learning2014$attitude,learning2014$points)
## 
##  Pearson's product-moment correlation
## 
## data:  learning2014$attitude and learning2014$points
## t = 6.2135, df = 164, p-value = 4.119e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3044463 0.5521335
## sample estimates:
##       cor 
## 0.4365245
cor.test(learning2014$deep,learning2014$points)
## 
##  Pearson's product-moment correlation
## 
## data:  learning2014$deep and learning2014$points
## t = -0.12997, df = 164, p-value = 0.8967
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1622192  0.1423931
## sample estimates:
##         cor 
## -0.01014849
cor.test(learning2014$stra,learning2014$points)
## 
##  Pearson's product-moment correlation
## 
## data:  learning2014$stra and learning2014$points
## t = 1.8916, df = 164, p-value = 0.06031
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.006340149  0.291945961
## sample estimates:
##       cor 
## 0.1461225

Results of correlation in the cor.test between attitude and points = 0.4365245 (p_value= 4.119e-09)

Results of correlation in the cor.test between deep and points = -0.01014849 (p_value= 0.8967)

Results of correlation in the cor.test between stra and points = 0.1461225 (p_value= 0.06031)

Look at the p_value: association between points and deep and stra is not statistically significant relationship so remove deep and stra variables from model

Question 4: relation ship between attitude and points

p1 <- ggplot(learning2014, aes(x = attitude, y = points))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3 + ggtitle( "Relation between attitude and points")
print (p4)

A linear model to the data. Points are explained by attitude.

The equation for the model is \[ Y_i = \alpha + \beta_1 X_i + \epsilon_i \] where Y represent points, X is attitude, \(\alpha\) is constant, \(\beta_1\) is regression coefficient for attitude, and \(\epsilon\) is a random term.

m1 <- lm(learning2014$points ~learning2014$attitude, data = learning2014)
results <- summary(m1)
knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")
Regression coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.637 1.830 6.358 0
learning2014$attitude 3.525 0.567 6.214 0

Intercept as well as attitude are statistically significant predictors. Coefficient of determination \(R^2\) = 0.1905537 that is not particularly high.

Question 5: Diagnostic plots.

plot(m1, which=c(1,2,5))


Chapter 3

Q2 Read data and print out the names of the variables in the data

describe the data set briefly

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires.The datasets provides information regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks.

*Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details). The data (n=382) has 35 different variables and is a joined data of the two datasets.

  • The names of the variables including

+school: binary variables: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira

  • Characteristics of participants: sex , age

  • Demographic and family information including variables: Address, famsize, Pstatus, Medu, Fedu,

  • Some variables about educational information: absences - numeric variable- number of school absences, failures: - number of past class failures, nursery: - attended nursery school (binary: yes or no), internet: - Internet access at home (binary: yes or no), guardian: - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

G1, G2, G3: express grades are related with the course subject, Math or Portuguese.

Q3: choose 4 interesting variables in the data and predict the association between them and high_alc

So I want to see how age (age). sex (sex), a romantic relationship (* romantic ), and number of school absences (absences*) variables effect to cohort consumption.

The hyopthesis is that getting older, being male, not in good relationship with partner, and more days of school absence will increase alcohol consumption.

Q4: Numerically and graphically explore the distributions of chosen variables

g1 <- ggplot(data = dat, aes(x = high_use, fill = sex))
g1 + geom_bar() 

From the plot 1, it can be seen that male take higher proportion than female in high_use.

g2 <- ggplot(data = dat, aes(x = high_use, fill =  romantic ))
g2 + geom_bar()

It can be seen here somehow not in a romantic relationship took more proportion of datohol consumption compare to in a romantic relationship

g3 <- ggplot(data = dat, aes(x = high_use, y= absences))
g3 + geom_boxplot()

More day absences from school increased datohol consumption

Q5: Logistic regression model

m <- glm(high_use ~ age+ sex + romantic+ absences, data = dat, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + sex + romantic + absences, family = "binomial", 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7179  -0.8456  -0.6286   1.0823   2.1893  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.54258    1.70433  -2.665  0.00769 ** 
## age          0.16915    0.10226   1.654  0.09810 .  
## sexM         0.99159    0.24303   4.080 4.50e-05 ***
## romanticyes -0.29584    0.26585  -1.113  0.26579    
## absences     0.07283    0.01828   3.983 6.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 422.16  on 377  degrees of freedom
## AIC: 432.16
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind (OR, CI)
##                     OR        2.5 %   97.5 %
## (Intercept) 0.01064589 0.0003570945 0.290015
## age         1.18430022 0.9701859445 1.450229
## sexM        2.69551085 1.6842578252 4.375275
## romanticyes 0.74390627 0.4371001910 1.242806
## absences    1.07554516 1.0398995842 1.116863

Interpret result: + Age: getting older has associate to increase datohol usage, however, the effect of age variable is not significant (P-value = 0.10 )

Q6: Prection analysis

Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.

probabilities <- predict(m, type = "response")

dat <- mutate(dat, probability = probabilities)

dat <- mutate(dat, prediction = probabilities > 0.5)

# tabulate the target variable versus the predictions

table(high_use = dat$high_use, prediction = dat$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   261    9
##    TRUE     88   24

So the model so

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)}

library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
## 
##     logit
cv <- cv.glm(data = dat, cost = loss_func, glmfit = m, K = nrow(dat))
summary (cv)
##       Length Class  Mode   
## call    5    -none- call   
## K       1    -none- numeric
## delta   2    -none- numeric
## seed  626    -none- numeric

Question 2:Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library (tidyverse)
library (corrplot)
## corrplot 0.84 loaded
library (plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#Load dataset Boston from MASS
data("Boston")

#look at data Boston
head (Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
#look at the structure of dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
data(Boston)

Data summary: This is the dataset about housing Values in Suburbs of Boston Description: The Boston data frame has 506 rows and 14 columns with explaination of variables below:

crim: per capita crime rate by town.

zn: proportion of residential land zoned for lots over 25,000 sq.ft.

indus: proportion of non-retail business acres per town.

chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox: nitrogen oxides concentration (parts per 10 million).

rm: average number of rooms per dwelling.

age: proportion of owner-occupied units built prior to 1940.

dis: weighted mean of distances to five Boston employment centres.

rad: index of accessibility to radial highways.

tax: full-value property-tax rate per $10,000.

ptratio: pupil-teacher ratio by town.

black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat: lower status of the population (percent).

medv: median value of owner-occupied homes in $1000s.

Question 3: a graphical overview of the data and show summaries of the variables in the data

## NULL
##       crim               zn              indus              chas         
##  Min.   :-0.3900   Min.   :-0.5700   Min.   :-0.7100   Min.   :-0.12000  
##  1st Qu.:-0.2150   1st Qu.:-0.4050   1st Qu.:-0.3825   1st Qu.:-0.04750  
##  Median : 0.3200   Median :-0.2550   Median : 0.3950   Median : 0.02000  
##  Mean   : 0.1786   Mean   :-0.0550   Mean   : 0.1929   Mean   : 0.08143  
##  3rd Qu.: 0.4500   3rd Qu.: 0.2775   3rd Qu.: 0.6300   3rd Qu.: 0.09000  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000  
##       nox               rm                age               dis         
##  Min.   :-0.770   Min.   :-0.61000   Min.   :-0.7500   Min.   :-0.7700  
##  1st Qu.:-0.360   1st Qu.:-0.29750   1st Qu.:-0.2625   1st Qu.:-0.5225  
##  Median : 0.305   Median :-0.21500   Median : 0.3050   Median :-0.3050  
##  Mean   : 0.190   Mean   :-0.01286   Mean   : 0.1736   Mean   :-0.1464  
##  3rd Qu.: 0.655   3rd Qu.: 0.19000   3rd Qu.: 0.5775   3rd Qu.: 0.2400  
##  Max.   : 1.000   Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.0000  
##       rad               tax             ptratio            black         
##  Min.   :-0.4900   Min.   :-0.5300   Min.   :-0.5100   Min.   :-0.44000  
##  1st Qu.:-0.2850   1st Qu.:-0.3050   1st Qu.:-0.2175   1st Qu.:-0.37750  
##  Median : 0.4600   Median : 0.4850   Median : 0.2250   Median :-0.22500  
##  Mean   : 0.2371   Mean   : 0.2364   Mean   : 0.1157   Mean   :-0.06071  
##  3rd Qu.: 0.6075   3rd Qu.: 0.6475   3rd Qu.: 0.3775   3rd Qu.: 0.16750  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000  
##      lstat              medv         
##  Min.   :-0.7400   Min.   :-0.74000  
##  1st Qu.:-0.4000   1st Qu.:-0.46000  
##  Median : 0.4150   Median :-0.38000  
##  Mean   : 0.1407   Mean   :-0.06857  
##  3rd Qu.: 0.5775   3rd Qu.: 0.31000  
##  Max.   : 1.0000   Max.   : 1.00000

In fig1 shows the distribution of variables and their relationship. For example:chas variables is the categorical variable

Question 4: Standardizing the data

#scale variables
boston_scaled <- scale (Boston)
summary (boston_scaled%>%round (digits = 2))
##       crim                 zn                indus          
##  Min.   :-0.420000   Min.   :-0.490000   Min.   :-1.560000  
##  1st Qu.:-0.410000   1st Qu.:-0.490000   1st Qu.:-0.870000  
##  Median :-0.390000   Median :-0.490000   Median :-0.210000  
##  Mean   :-0.000119   Mean   :-0.002154   Mean   :-0.001304  
##  3rd Qu.: 0.010000   3rd Qu.: 0.050000   3rd Qu.: 1.010000  
##  Max.   : 9.920000   Max.   : 3.800000   Max.   : 2.420000  
##       chas                nox                  rm           
##  Min.   :-0.270000   Min.   :-1.46e+00   Min.   :-3.880000  
##  1st Qu.:-0.270000   1st Qu.:-9.10e-01   1st Qu.:-0.570000  
##  Median :-0.270000   Median :-1.40e-01   Median :-0.110000  
##  Mean   : 0.001838   Mean   : 5.93e-05   Mean   : 0.000138  
##  3rd Qu.:-0.270000   3rd Qu.: 6.00e-01   3rd Qu.: 0.480000  
##  Max.   : 3.660000   Max.   : 2.73e+00   Max.   : 3.550000  
##       age                 dis                 rad           
##  Min.   :-2.330000   Min.   :-1.270000   Min.   :-9.80e-01  
##  1st Qu.:-0.837500   1st Qu.:-0.800000   1st Qu.:-6.40e-01  
##  Median : 0.315000   Median :-0.280000   Median :-5.20e-01  
##  Mean   : 0.000415   Mean   :-0.000138   Mean   : 5.93e-05  
##  3rd Qu.: 0.907500   3rd Qu.: 0.660000   3rd Qu.: 1.66e+00  
##  Max.   : 1.120000   Max.   : 3.960000   Max.   : 1.66e+00  
##       tax                ptratio             black          
##  Min.   :-1.3100000   Min.   :-2.70000   Min.   :-3.900000  
##  1st Qu.:-0.7700000   1st Qu.:-0.49000   1st Qu.: 0.202500  
##  Median :-0.4600000   Median : 0.27500   Median : 0.380000  
##  Mean   : 0.0001383   Mean   : 0.00164   Mean   :-0.000079  
##  3rd Qu.: 1.5300000   3rd Qu.: 0.81000   3rd Qu.: 0.430000  
##  Max.   : 1.8000000   Max.   : 1.64000   Max.   : 0.440000  
##      lstat               medv           
##  Min.   :-1.53000   Min.   :-1.9100000  
##  1st Qu.:-0.79750   1st Qu.:-0.5975000  
##  Median :-0.18000   Median :-0.1400000  
##  Mean   : 0.00002   Mean   : 0.0000988  
##  3rd Qu.: 0.60000   3rd Qu.: 0.2700000  
##  Max.   : 3.55000   Max.   : 2.9900000
#to see class of the scaled object
class (boston_scaled)
## [1] "matrix"
#change boston_scaled from matrix into data frame
boston_scaled <-as.data.frame(boston_scaled)
summary(Boston$crim)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620
#creat a quantile vector of crim 
bins <- quantile(boston_scaled$crim)
#see quantile parts of bins
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low","med_high", "high") , include.lowest = TRUE)
table (crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

All means of variables move to 0. Now the crime variable is set to 4 categories according to how high is the per capita crime rate by town.

Question 5: Linear discriminant analysis

# linear discriminant analysis with target variable crime
lda_crime <- lda(crime ~ ., data = train)
lda_crime
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2673267 0.2425743 0.2475248 0.2425743 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       0.9495409 -0.9199965 -0.12651054 -0.8639591  0.43404501
## med_low  -0.0680943 -0.3348050 -0.03128211 -0.5772463 -0.10626698
## med_high -0.3929105  0.2309388  0.08200995  0.4193641  0.05619791
## high     -0.4872402  1.0149946 -0.03128211  1.0506227 -0.43184272
##                 age        dis        rad        tax     ptratio
## low      -0.8451067  0.8750523 -0.6851839 -0.7352272 -0.40629552
## med_low  -0.3744090  0.3926826 -0.5365473 -0.4405401 -0.06241517
## med_high  0.4518157 -0.4274656 -0.4110831 -0.2988490 -0.32080881
## high      0.8062673 -0.8539045  1.6596029  1.5294129  0.80577843
##                black       lstat         medv
## low       0.38275292 -0.75645050  0.481992416
## med_low   0.30585520 -0.16128368 -0.005896954
## med_high  0.06399668  0.06575168  0.134411187
## high     -0.85153829  0.87940650 -0.714305390
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.117462952  0.545422837 -1.03471026
## indus    0.002425109 -0.397923258  0.14855148
## chas    -0.087693725  0.032985312  0.10848084
## nox      0.275768413 -0.763331270 -1.20273382
## rm      -0.116106199 -0.043194029 -0.13947095
## age      0.293425740 -0.296062208 -0.16136188
## dis     -0.098528392 -0.167558107  0.28843272
## rad      3.367732753  0.833063749 -0.38742641
## tax      0.073670474  0.183675900  0.91067114
## ptratio  0.151042926  0.063378885 -0.22150150
## black   -0.160182566  0.004527969  0.09739340
## lstat    0.174288616 -0.273400891  0.46644625
## medv     0.171870311 -0.470455401 -0.00897782
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9516 0.0372 0.0111
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)
# plot lda with 2 discriminants
plot(lda_crime, dimen = 2, col = classes, pch = classes)
lda.arrows(lda_crime, myscale = 1)

# eigenvalues:  
lda_crime$svd
## [1] 44.318707  8.767059  4.792523

The purpose of linear discriminant analysis (LDA) is to find the linear combinations of the original variables (in Boston dataset) that gives the best possible separation between the groups.

We separate the crime into 4 groups: “low”, “med_low”,“med_high”, “high”. so the number of groups G=4, and the number of variables is 13 (p=13). The maximum number of useful discriminant functions that can separate is the minimum of G-1 and p. Thus, we can find in the result 3 useful discriminant functions (Proportion of trace: LD1, LD2, LD3) to separate.

Ratio of the between- and within-group standard deviations on the linear discriminant variables is higher -> model is more precious. LD1 = 0.96 means this model can discriminate 96% difference between groups

In the above picture we can see the linear combination of the variables that separate the crime classes.

Question 6) Predicting classes with the LDA model on the test data

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda_crime, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14       3        2    0
##   med_low    4      19        5    0
##   med_high   0      12       13    1
##   high       0       0        1   28

In the above picture we can see how the model perdicts the Crime classes. The model is able to predict closely to real value in “high”" and “low”" rates.

Question 7) Standardazing the dataset and running the k-means algorithm

Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

data("Boston")
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled)
dist_eu <- dist(Boston)
summary (dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.624 170.539 226.315 371.950 626.047
dist_man <- dist(Boston, method = 'manhattan')
summary (dist_man)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.016  149.145  279.505  342.899  509.707 1198.265
# k-means clustering
km <-kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster )

#plot focus on 
pairs(Boston[6:10], col = km$cluster)

#K-means: determine the k
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

According to the qplot above, the optimal number of clusters is around 2, because afther that the WCSS changes dramatically

km <-kmeans(Boston, centers = 2)
pairs(Boston , col = km$cluster)

pairs(Boston[1:5] , col = km$cluster)

pairs(Boston[6:10] , col = km$cluster)

pairs(Boston[11:14] , col = km$cluster)

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda_crime$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_crime$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## This version of Shiny is designed to work with 'htmlwidgets' >= 1.5.
##     Please upgrade via install.packages('htmlwidgets').